f\ n ir± I hi A I DA DCD VoL 28 no - 5 2012 ' pages 656-663 

\Jrtl\jl IVML. "A\~EH doi:10.1093/bioinformatics/bts028 



Sequence analysis Advance Access publication January 12, 2012 

Estimation of pairwise sequence similarity of mammalian 
enhancers with word neighbourhood counts 

Jonathan Goke 1 '*, Marcel H. Schulz 2 , Julia Lasserre 1 and Martin Vingron 1 '* 

1 Department for Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, 
Germany and 2 Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 
PA 15213, USA 

Associate Editor: Martin Bishop 



ABSTRACT 

Motivation: The identity of cells and tissues is to a large degree 
governed by transcriptional regulation. A major part is accomplished 
by the combinatorial binding of transcription factors at regulatory 
sequences, such as enhancers. Even though binding of transcription 
factors is sequence-specific, estimating the sequence similarity 
of two functionally similar enhancers is very difficult. However, a 
similarity measure for regulatory sequences is crucial to detect 
and understand functional similarities between two enhancers and 
will facilitate large-scale analyses like clustering, prediction and 
classification of genome-wide datasets. 

Results: We present the standardized alignment-free sequence 
similarity measure N2, a flexible framework that is defined for 
word neighbourhoods. We explore the usefulness of adding reverse 
complement words as well as words including mismatches into 
the neighbourhood. On simulated enhancer sequences as well 
as functional enhancers in mouse development, N2 is shown to 
outperform previous alignment-free measures. N2 is flexible, faster 
than competing methods and less susceptible to single sequence 
noise and the occurrence of repetitive sequences. Experiments on 
the mouse enhancers reveal that enhancers active in different tissues 
can be separated by pairwise comparison using N2. 
Conclusion: N2 represents an improvement over previous 
alignment-free similarity measures without compromising speed, 
which makes it a good candidate for large-scale sequence 
comparison of regulatory sequences. 

Availability: The software is part of the open-source C++ library 

SeqAn (www.seqan.de) and a compiled version can be downloaded 

at |http://www.seqan.de/projects/alf.htmT] 
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1 INTRODUCTION 

Mammalian organisms consist of several hundred different cell 
types. Every cell has the same repertoire of genes; however, 
only a subset will be expressed to enable cell type-specific 
phenotypes. Many different factors regulate gene expression, 
of which genetically encoded transcriptional regulation seems 
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to play the major part dWilson et all l2008l) . Sequence-dependent 
gene regulation is mainly achieved through the binding of 
transcription factors at short DNA motifs. These transcription factor 
binding sites often occur in regulatory clusters in the genome, 
called c/s-regulatory modules (CRMs). Some CRMs can repress 
transcription, whereas others, referred to as 'enhancers', can enhance 
gene expression. Studies in Drosophila showed that the combination 
of binding sites together with the set of transcription factors 
actively rec r uited to a CRM determines its cell type spe cificity 
iGoto et all Il989l: ISmall et all 1 199 it Izinzen et all l20Q9h . More 
generally speaking, regulatory sequences with a similar binding site 
content can be expected to drive similar expression patterns. This is 
analogous to coding sequences, where sequence similarity has been 
used for many years to estimate functional similarity. The pairwise 
similarity of coding sequ e nces i s usually computed using global 
(iNeedleman and Wunschl Il970h or local dSmith and Watermanl 
Il98ll) alignments. This approach works well for sequences which 
are at least partially alignable; however, this is not the case for non- 
homologous CRMs. The location and orientation of binding sites in 
CRMs that show similar cell type- specific activity may differ widely, 
making it impossible to produce alignments. 

Alignment-free methods compare sequenc es according to their 
word content, see (IVinga and Almeidal 120031) for an overview. The 
initial purpose was to design a fast and accurate measure of pairwise 
(dis-) similarity that could be used in databases where traditional 
alignments were t oo slow telaisdellL[l985lCarpenter et a/.L 120021: 
iHide et a/.l[l994l) . In the meantime, alignment-fr ee methods have 
been applied in oth er contexts such as ph ytogeny 
and motif finding dGordan et all l2010l) . The idea to describe a 
sequence by its word content directly fits the model of CRMs, where 
we assume that a similar function is reflected in a similar binding 
site content. 

Word co unt-based methods have been used to compare regulatory 
sequences dKantorovitz et all 120071 : Ivan Helderll2004l) . However, 
these methods calculate the similarity of sequences based on exact 
word counts, whereas transcription factor binding sites are generally 
more flexible patterns. Furthermore, the genomic orientation of 
CRMs and of the binding sites within is most often unknown, 
highlighting the need to compare sequences according to the word 
counts on both strands simultaneously. As an example, the word 
w = CATAAT might be bound by the same transcription factor as the 
words CTTAAT and ATTATG, the former having one substitution, 
the latter being on the reverse strand. Exact word comparison 
methods consider these words dissimilar. More generally, let n(w) be 
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the set of words which are similar to w (the 'neighbourhood' of w). 
To overcome the limitation of exact word comparison methods, we 
need to develop a similarity measure that compares sequences based 
on word neighbourhoods. Theoretical approaches that consider 
approximate word match es have been studied before dBurden et al[ 
120081 ; iForet et fl/.Ll2006l) : however, no applicable method has been 
published for the purpose of pairwise comparison. 

In this study we define N2, an alignment-free comparison method 
that integrates all words in the neighbourhood of w. We compare N2 
to other alignment-free methods on simulated sequences and tissue- 
specific enhancer sequences identified in vivo in mouse embryos. 
The code and an executable file of the N2 similarity and other 
alignment-free methods present ed here is available a s part of the 
open-source C++ library SeqAn dPoering et all\200§) . 



We now calculate the normalized standardized neighbourhood count vector 

N s = — — 
" ll# S ll 

where ||-|| represents the Euclidean norm. We define the N2 similarity of 
two sequences as the inner product of their normalized standardized word 
neighbourhood count vectors: 

N2(S u S 2 ) = <N Sl ,N S2 > (3) 
= £#*x##. (4) 



2 METHOD 

2.1 The N2 similarity score 

Traditionally, the idea of alignment-free methods is to compare two 
sequences S\ and 52, of length l\ and h, based on the numbers of occurrences 
of all words w of length k over the alphabet X = {A, C , G , T}. Let A be the 
set of all such words w with \A\ being the total number of words (4 k in the 
case of DNA sequences). We associate a sequence S of length / with the 
word count vector 



AT S =(< ,<,..., A£ |/(| ), with 



l-k+l 



Ar, 



l(S[i...i+k-l]=w). 



(1) 
(2) 



To overcome the restriction to exact word counts, we extend Equation Q to 
word neighbourhood counts. We define the set of words in the neighbourhood 
of the word w as n(w). The neighbourhood may be defined appropriately for 
every application, for example, to fit transcription factor binding motifs, 
to allow for reverse complement word counts or to include mismatches. 
Integrating neighbourhood counts for every word w reduces the influence 
of w itself. This leads to word counts 'smoothing', i.e. inexact words are 
considered similar, and also to 'blurring', since inexact words might not be 
related. To control for these effects, we associate every word W in n(w) 
with a weight a w > which may differ for the considered application. We then 
compute the weighted word neighbourhood counts N n ( w ) for every word w 
of the sequence S: 



n: 



w f €n(w) 



Depending on the choice of n(w), N^ w ^ might be the sum of highly dependent 
variables since word occurrences of overlapping words such as C AAAA and 
AAAA A are strongly correlated. Additionally, the variance of individual word 
counts should be considered, since, for example, a high number of CAGCTG 
occurrences is more informative than a high count of self overlapping 
words such as AAAAAA where a Poly- A stretch of length 15 already gives 
10 occurrences. Furthermore, some words are more likely to occur than 
others, GC-rich words for example are less frequent in mammalian genomes 
than AT -rich words. We correct for inter- variable dependency, word count 
variances and word probabilities by standardizing the word neighbourhood 
counts: 



N s 



- E ^(w) ] 



Since the word counts might be dependent, the covariance of all words in the 
word neighbourhood has to be computed to obtain VL/V^^] (Section 2.2). 



As a consequence of the normalization, —1 <N2(S\ ,S 2 ) < 1, and S\=S 2 ^> 
N2(S\ ,S 2 ) = l, i.e. equal sequences will always have the maximum pairwise 
similarity of 1 . 



2.2 Calculation of expected value and variance 

The N2 score can be computed with Markov models of any order. Here, 
we illustrate the calculation of the expected value E[A/^ (w) ] and variance 
VfW^^] assuming a first-order Markov model. For clarity, the superscript 
indicator for sequence S is omitted in the following. Let the sequences 
be modelled by a first-order homogeneous stat ionary Markov chain with 
transition probabilities Tt(iJ) jRobin et <3/ll2005l) . The probability n(w) that 
a word w occurs at a specific position i depends on the probability that the 
first letter occurs, denoted /x(w[l]) (stationarity of the Markov chain) and 
can be calculated as follows: 



li(w) = fi(w[l]) x ]~[7r(w[/ - 1], w[/]) . 

j=2 



With this at hand, we ca n calculate the expe cted value E[N n ( w )] of the 
word neighbourhood counts JRobin "aZll2005h : 



E[N n(w) ]=E 



w' £n(w) 



= [AM, with 

w'en(w) 

E[N w ,] = (l-k+lMw f ). 



The variance is important to correct for the dependency of overlapping 
words in the word neighbourhood. The variance V[W W ( W )] of the word 
neighbourhood counts corresponds to the variance of the sum of the weighted 
word counts N w : 



V[N« W) ]=V J2 

w f €n(w) 
v' en(w)w" en(w) 
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Theco variance of word counts can be calculated according to lRobin et all 
<2005l) : 



Cov[N w ,N w r] = 
k-i 



e k - d (w,w f ) Y[ ^(w[/-l]V [/])-/x(w/) 
j=k-d+i 



€ k -d(w',w) ]"[ 7T(w[/'-l],w[/"])-/x(w) 

j=k-d+l 



+Ijl(w)h(w') (l-2k-t+2) 



L ^[i]) 

-(/-£ + l)/i(w)/x(w/). 
with 6 indicating word overlaps 



Mw[l]) 



(5) 
(6) 

(7) 

(8) 

(9) 

(10) 

(11) 
(12) 



e M (w,w') = 



lifvi#-ii + l...fc] = w'[l...M] 
0 otherwise. 



In the case where w = w', we have C ov [N w , Af w / ] =Y[N W ] . The word count 
variance can be calculated as follows jRobin et fl/ll2005l) : 



V[N w ] = (l-k + lMw)[l-fi(w)] 

k-i 

d=i 



€k-dM ]"[ 7T(w[/'-l],w[/"])-/x(w) 



Z-2*+l 

+2[/x(w)] 2 ^ (l-2k-t+2) 



1 



-7r'(w[fc],w[l])-l 



(13) 
(14) 

(15) 

(16) 

L .. J ,.. L ^ J/ . . (17) 

Terms {T7J and fill are costly to compute and have minor effects on the 
variance and covariance. In the following, we will therefore neglect those 
terms, thereby assuming that the occurrence of non-overlapping words is 
independent of the sequence in between (/x(w[l])~7T r (w[^],w[l])). 

2.3 Implementation and instances of N2 

The implementation th at we provide for N2 is part of the SeqAn library 
iDoering et all 120081) . It requires a set of sequences in .fasta format as 
input and returns a matrix with all pairwise similarity scores. The word 
length k (default k = 5) and the background model order (default 1) may 
be chosen manually and the normalized standardized word neighbourhood 
counts may be returned to obtain additional information on important words. 
The calculation of the scores is divided into two steps, a pre-processing step 
and a comparison step. 

The pre-processing step is run for every sequence individually. 
We estimate the background Markov model, count the words and calculate 
the word's probabilities and covariances. To avoid computing the full 
covariance matrix, only required entries are dynamically computed and 



stored. We then compute the standardized normalized word neighbourhood 
counts. The running time of this step depends on the length of the input 
sequences, the Markov model's order, the word length and the size of the 
word neighbourhood. It is linear in the number of input sequences. 

In the comparison step, the inner product of the standardized normalized 
word neighbourhood counts is computed for all pairs of sequences. The 
running time of this step depends on the word's length and is quadratic in 
the number of input sequences. 

The most basic instance of N2, with n(w) = w will be referred to as N2* . In 
our implementation, n(w) may be extended to include its reverse complement 
(rc), 

n rc (w) = {w,rc(w)} (18) 

all words equal to w with one mismatch (mm), 

n mm (w) = {w'\dist hamming (w,w') <= 1} (19) 

or the combination of both (mm, rc), where 

n mm , rc (w) = {w f ,rc(w')\dist hamm i ng (w, w') <= 1} . (20) 

In the following, we will refer to these instances as N2 rc , N2 mm , N2 mm > rc . 
The word count of w (and its reverse complement when selected) is always 
weighted with a w = \, for all other words w' in n(w) an alternative weight 
a w i may be chosen. The weights for mismatch neighbourhood counts are 
indicated in superscript, we use a w > = 1 (N2 mm ^- 0 ^) if not stated otherwise. 
Note that in Equations (19) and (20) our neighbourhood definition only 
covers direct neighbours, not neighbours of neighbours. 

2.4 Other methods 

The simplest score between two seq uences S\ and S 2 is obtained by 
calculating either the euclidean distance <BlaisdellL[l986l) or the inner product 
( iLippert et all |2002|) of the word count vectors N Sl and N Sl as defined in 
Equa tionfl). Both methods are called D2 and h ave been applied to biological 
data dCarpenter et all l2002t Imde et a/.U 19941) . Here we focus on the latter 
version using the inner product: 

D2(S u S 2 ) = <N Sl ,N S2 > 
wgA 

D2 is directly dependent on the length of the sequences, it can therefore not 
be used for comparing sequences of different length. 

The D2 z-score (D2z) was proposed t o obtain a standardized D2 score for 
which the significance can be estimated jKantorovitz et all\200lh : 



D2z(Si,S 2 ) = 



D2(SuS 2 )-E[D2(SuS 2 )] 
JV[D2(S U S 2 )] 



The expected value for D2 has been studied for approximate word matches, 
and u pper and lower bounds for the variance have been calculated ( Burden 



et al., 120081) . This work is largely of theoretical nature for Bernoulli 
background models and no implementation is provided, and therefore we 
could not integrat e this work into the a nalysis (Section 4). 

The D2* score iReinert et a/ll2009l) standardizes the word counts instead 
of the inner product. Similarly to N2, D2* is defined as the inner product 
of the standardized word counts as shown in Equation J3), but in this case 
n(w) only contains w itself, and the background model is computed on the 
concatenation of both sequences. 

Let ii(w) be the probability of w, the expectation of N% is then estimated by 
E[A^] = (/ — k+l)/jL(w). The authors assume a Poisson distribution, which 
implies that the variance is equal to the expected value. D2* was originally 
proposed with a Bernoulli background model for the computation of ijl(w). 
Here, we extended this score to use Markov background models of higher 
order. For the purpose of pairwise compariso n, the D2, D2* and D2 z scores 
have been implemented in the SeqAn library jDoering et fl/.ll2008l) and are 
part of the executable that is available online. 
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2.5 Estimating the background Markov model 

Calculation of the expected value and variance of the word counts assumes 
that we know the background model that describes the sequence. For N2, 
we estimate the background model separately for every sequence. This 
allows us to precompute word probabilities and variances (Section 2.3) 
leading to a great reduction in computational costs. Since CpG dinucleotides 
in mammahaiisejiojTn^ and 
Frommer, 1 19871) . a Bernoulli background model is insufficient to estimate 
word probabilities. This can be seen on simulations, where the first-order 
Markov model consistently outperforms the Bernoulli model across all 
methods (Supplementary Table SI). The optimal order for the Markov 
background model for enhancer sequences is an unknown function of 
organism complexity and sequence length. Due to the limited size of 
enhancer sequences, estimating higher order Markov models likely results 
in overfitting and poor estimates. Our analysis will therefore rely on a first- 
order Markov chain as background model for all methods throughout this 
analysis. 



Running lime, k=6 




1000 2000 3000 4000 5000 6000 
Number of Sequences 



2.6 Masking repeats 

Repeats such as SINE elements have a substantial influence on 
pairwise scores. We use the UCSC pre-masked genome sequenc e [hg!9, 
Repe atMasker (www.repeatmasker.org), TandemRepeatsFinder iBensonl 
Il999h l in order to hide those repetitive elements. Any repeat-masked 
sequence is split into a set of repeat-free subsequences by cutting out all 
repeat regions. Words are counted in this set such that no artificial words 
are created by concatenation. We use (number of counted words) +k— 1 as 
an estimation of the length of the repeat-free sequence. Repeat-masked 
sequences are treated equally for all methods. Note that this is slightly 
different to the original method proposed for D2z, which introduced artificial 
words by concatenating sequences. 

3 RESULTS 

3.1 N2 can be computed quickly 

Genome-wide datasets consist of many thousand regulatory 
sequences. The computation of pairwise similarities needs to be 
efficient for large-scale usage. We estimated the running time of 
each score on sets with various numbers of sequences where we 
computed the matrix of all pairwise similarities (quadratic number of 
scores computed). The methods show strong differences in practise 
(Fig-ID> but N2 and its variants are always faster than the other scores 
with a statistical model for realistically chosen numbers. Computing 
pairwise scores for 5000 enhancers with k = 6 takes 2h for N2* 
(4h for N2 rc , 20 h for N2 rc ' mm \ it takes about 42 h for D2* and 
91 h for D2z. 

The computation of N2 is dominated by the pre-processing 
step, which scales linearly in the number of sequences since the 
neighbourhood counts are calculated once for every sequence in 
advance (Fig.[T] Table[T] Section 2). In contrast, D2z andD2* cannot 
pre-compute normalized counts like N2, and scale quadratically 
in the number of sequences. D2z calcu lates z-scores on pairs o f 
sequences which are not pre-processed dKantorovitz et a/.[|2007b , 
and D2* calculates the background mo del on the concaten ation of 
sequences that cannot be pre-computed jReinertgfg/.Ll2009h . While 
this is likely to increase the accuracy of the model, running times 
are drastically higher. Computing pairwise scores for realistically 
large datasets is therefore nearly impossible for both D2z and D2*. 
This makes the N2 score very attractive for large-scale applications 
such as classification of regulatory sequences, or applications that 
support pre-computed data structures such as database searches. 



Fig. 1. Running time comparison. All pairwise scores were calculated for 
random sequences of length 1000 bp. 



Table 1. Running time of the different methods in O notation. 



Running time in O notation 


D2 


0(nl+n 2 4 k ) 


D2z 


(Kantorovitz et al, 2007) 


D2* 


0(n 2 (l+4 k +4 m )) 


N2 


0(rc(/+4 m +4 fc NeighbourhoodSize 2 )+rc 2 4 fc ) 



n: number of sequences; /: average sequence length; k\k-mQV size; m: Markov model 
order. The running time for D2* is dominated by the quadratic term. The running time 
for N2 is dominated by the linear term (pre-processing). 



3.2 N2 is robust against single sequence noise 

Ideally, the pairwise score between two sequences should reflect 
the sequences' similarity. However, in practise, word count-based 
methods can be heavily influenced by noise specific to individual 
sequences, mea ning that some sequences will intrinsically have high 
(or low) scores (iLippert et ali 120021 : Eeinert et a/.Ll2009h . Without 
proper correction, the pairwise score is an attribute of the individual 
sequence rather than of the pair of sequences. This is especially 
prominent for D2, where a high number of occurrences of a repetitive 
self-overlapping word (such as AAAAA) in one sequence will always 
induce high pairwise scores. To quantify the influence of single 
sequence- specific noise on pairwise scores, we studied the behaviour 
of D2, D2z, D2 and N2 for scoring pairs of unrelated sequences 
simulated by the same background model. We calculated scores for 
all sequence pairs (Si,Sj) for 500 such unrelated sequences. We 
chose a threshold t to select the top 5% highest scoring sequence 
pairs (high scoring pairs). For every sequence S(, we calculated the 
number of high scoring pairs Q: Ci = J2j^( SC0TQ (^i^j)^ t )- Since 
all sequences were generated by the same model, the expected value 
of Q, E(Q), is equal for all sequences S(. Here, 5% of the 499 
sequence pairs of Si are expected to have a score greater than t, 
thus E(Q) = 24.95. As as reference, we calculated C = {Ci,...,Q} 
when we randomly assign scores to sequence pairs. This method 
is not influenced by the sequence at all and therefore recapitulates 
the expected behaviour for the unrelated sequence pairs (Fig. [2] 
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Influence of sequence composition on pairwise scores 
for unrelated sequences 
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Table 2. Comparison of the different methods (k = 6) when the genomic 
orientation of the motif is unknown 

Performance with implanted &-mers, random strand 



Motif 



5%Precision 



AUC ROC 



AUC PR 





setting: mlr8 


m4r2 


mlr8 


m4r2 


mlr8 


m4r2 


D2 


0.88 


0.59 


0.72 


0.54 


0.72 


0.54 


D2z 


0.91 


0.64 


0.74 


0.56 


0.73 


0.56 


D2* 


0.87 


0.66 


0.71 


0.58 


0.70 


0.57 


N2* 


0.86 


0.65 


0.71 


0.58 


0.70 


0.57 


N2 rc 


0.93 


0.71 


0.77 


0.60 


0.77 


0.59 



Bold numbers indicate best performance. 



Fig. 2. Influence of single sequences on pairwise scores. All pairwise 
scores for 500 sequences generated by the same model were calculated. Q 
measures the number of sequence pairs for sequence Si among the highest 
5% of all scores (high scoring pairs). Since all sequences were created 
using the same model, the distribution of C = {C\, ...,C/} from alignment- 
free methods should be similar to the distribution of C obtained from a 
random scoring method ('expected', black line). A different distribution 
would indicate that the number of high scoring pairs is strongly dependent on 
the individual sequence, indicating that pairwise scores are dependent on the 
single sequence noise rather than on the similarity of the sequence pair. (A) 
Uniform nucleotide distribution, all methods show the expected behaviour. 
(B) AT -rich nucleotide distribution, D2 and D2z differ from the expected 
behaviour, showing that these pairwise scores are strongly influenced by the 
sequence composition. 



black line). We then calculated C for the four alignment-free 
sequence comparison methods. 

The distribution of C when N2* is used is close to the expected 
distribution for unrelated sequences (Fig. [2}. This shows that N2 
is robust against single sequence- specific noise as the numbers of 
high scoring sequence pairs are not influenced by the individual 
sequences (see Supplementary Figs SI and S2 for N2 rc and 
N2 mm,rc ). In contrast, D2 and D2z show a very different distribution 
of C from the expected behaviour in the non-uniform case. Figured 
shows that the number of high scoring pairs strongly varies, 
suggesting that the expected number for Q is different for every 
sequence Sj, even though all sequences were generated by the 
same model. This shows that the number of high scoring pairs 
detected with these methods is strongly influenced by the individual 
sequence, indicating that pairwise scores measure the individual 
sequence composition and not the similarity of the sequence pair. 
Prior work comparing regulatory se quences using al ignment-free 
meth ods did not consider this effect JDai et all 120081: Kantoro vitz 
et al. , 120071) . The above results confirm that neither the D2 nor the 
D2z- score should be applied to real biological sequences ( Lippert 
et al . 120021 : iReinert et all l2009h . 

Other sequence noise such as repeats and stretches of low 
complexity occurs frequently in genomic data. N2 is more robust to 
this type of noise than D2* and D2z due to its correction for word 
overlaps and normalization of counts (Supplementary Table S2). 
Our analysis suggests that N2 should be used when repeat-masking 
is not an option. 



3.3 Simulation studies 

To test the performance of N2 on simulated data, we randomly 
generated sequ ences with a similar dinucl eotide content as the 
mouse genome (iThomas-Chollier et a/.Ll201lh (mm9) as background 
sequences (negative set). We then implanted m randomly chosen 
motifs of length 5 r times into the same background sequences to 
simulate CRMs (' positive set'; mlr8: m = l, r = 8; m4r2: m = 4, 
r = 2). Following (iKantorovitz et al. 1 l2007h . we computed all 
pairwise scores for the corresponding negative and the positive 
sets. The pairwise scores from the negative and the positive sets 
were combined and ranked. Based on this ranked list, we evaluated 
the performance of the above methods for pairwise sequence 
comparison using the area under ROC curve (AUC ROC) and area 
under precision-recall curve (AUC PR). We further estimated the 
interpolated precision at 5% recall which we term 5% precision 
for short. Results show average values over 25 simulations, each 
time drawing 100 random sequences of length 1000 bp and inserting 
random motifs, thus covering different motif compositions in an 
unbiased way. We tested the performance counting words of size 
k = 6 using a first-order Markov model for word probabilities (see 
Supplementary Tables S3 and S4 for & = 5). 

We simulated two different settings to evaluate the performance 
of the neighbourhood concept of N2. First, we implanted randomly 
sampled 5mers into the forward and backward strand of the 
sequences to simulate the orientation independence of binding sites 
in CRMs. We specifically designed the N2 rc variant for this scenario 
and, indeed, N2 rc performs best (Table [2}. Second, we randomly 
sampled words and implanted these with one mismatch at a random 
position to simulate more flexible motifs. The N2 mm variant was 
designed for this scenario as it considers the word neighbourhood for 
the similarity. In these simulations, the N2 mm variant with mismatch 
weights a w = \.Q shows the best performance, demonstrating the 
value of neighbourhood counts to score sequences with approximate 
word matches (Table [3] see Supplementary Figs S3 and S4 for 
different choices of a w ). These simulations confirm the value of 
extending exact word count methods to word neighbourhoods. 

3.4 Pairwise comparison of tissue-specific enhancers 

The above simulations demonstrated the ability of N2 to distinguish 
artificial CRMs from unrelated sequences. Currently, our knowledge 
on regulatory sequences is limited and simulations can only 
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Table 3. Comparison of the different methods (k = 6) when motifs are 
sampled from all &-mers with one mismatch to the word 

Performance with implanted &-mers, mismatch 



5% Precision AUC ROC 



AUC PR 



Motif 





setting: mlr8 


m4r2 


mlr8 


m4r2 


mlr8 


m4r2 


D2 


0.59 


0.51 


0.53 


0.48 


0.53 


0.49 


D2z 


0.59 


0.54 


0.54 


0.51 


0.53 


0.51 


D2* 


0.60 


0.54 


0.54 


0.51 


0.54 


0.51 


N2* 


0.59 


0.54 


0.54 


0.51 


0.54 


0.51 


N 2mm(0.01) 


0.60 


0.54 


0.55 


0.51 


0.54 


0.51 




0.65 


0.55 


0.57 


0.52 


0.57 


0.53 



Bold numbers indicate best performance. 



Interpolated Precis ion- Recall Plot (k=6) 
Forebrain enhancer Q Limb enhancer 



"S 



\ 


D2z 
D2* 
N2* 











0.0 0.2 



0,4 0.6 
Recall 



0.8 1.0 




Fig. 3. Precision-recall curve for enhancers active during mouse 
development. The plots show the precision average over 25 samples each 
time drawing 500 enhancer sequences (positive) and 500 unrelated genomic 
sequences of equal length as the enhancers (negative). (A) Precision-recall 
curve for forebrain enhancers. (B) Precision-recall curve for limb enhancers. 



approximate the real nature of enhancers. Tissue- specific enhancers 
in mouse embryos have been identi fied in a genome-wide manner 
using the co-activator protein p300 felow et a/.L |2Q 1 0l : I Visel et a/1 
120091) . These datasets allow us to test whether alignment-free 
methods are able to discriminate in vivo identified enhancers that 
show similar activity from genomic background. We used enhancers 
active in forebrain, midbrain, li mb and heart tissue of the developin g 
mouse embryo as positive sets blow et fl/.Ll2O10l:IVisel et a/.Ll2009h . 
We compared pairwise scores from these tissue-specific enhancers 
with pairwise scores from genomic sequences of the same length 
randomly sampled from the mouse genome, ensuring a maximum 
of 30% of repetitive sequence for every negative sample. To obtain 
accurate estimations, we calculated the average over 25 samples, 
each time drawing 500 sequences from the positive set. Using the 
same evaluation measures as in the previous section, we tested the 
ability of alignment-free sequence comparison methods to detect 
functional similarity of regulatory sequences. 

The choice of parameters will influence the results obtained 
from alignment-free comparisons. For N2, the main parameters 
are the length of the &-mers k and the weights of the 
words in the neighbourhood (a w ). We therefore tested £ = 4,5,6 
and mismatch weights a w = {l, 0.75, 0.5, 0.25, 0. 1 , 0.05, 0.01 , 0.001} 
(Supplementary Figs S5-S8). This analysis indicates that a w should 
be larger for higher values of k where the expected number of &-mer 
occurrences is <1. Wh ile different parameters m ight improve results 
for different datasets (iKantorovitz et all 120071) , we selected k = 6 
and mismatch weights of 1 as reasonable parameters throughout the 
analysis to have a consistent and comparable setup. 

Figure [3] and Table 0] show the results for pairwise comparison 
of tissue-specific enhancers with alignment-free methods. Across 
all tissues, j\f2 mm ^-^^ rc gives the best results, demonstrating 
that N2 is most suitable to detect tissue-specific activity of 
regulatory sequences. The results also confirm the value of the 
word neighbourhood concept: comparing N2 rc with N2* shows 
that the neighbourhood extension to the reverse complement is 
always preferable (Table |4}. Extending the word neighbourhood to 
all words with one mismatch (A^2 mm ^ 10 ^' rc ) further improves the 
results by 6-15% (Table 0]). These results support the usage of N2 
with word neighbourhood counts to score the similarity of regulatory 
sequences. 



Tissue-specificity of enhancers. The above results indicate that 
tissue-specific enhancer sequences indeed have a similar word 
content. However, a comparison of ChlP-Seq data with randomly 
sampled genomic sequences might be biased towards measuring 
similarities introduced by the technology, such as similar GC content. 
To test this, we verified whether we can discriminate enhancers 
according to the tissue where they drive expression. For that purpose, 
we computed all pairwise scores of enhancers active in the same 
tissue (positive set) and all pairwise scores between enhancers active 
in other tissues (negative set), discarding all enhancers active in 
multiple tissues. To correct for length differences between datasets 
from different tissues, we selected 750 bp in the middle of the 
reported enhancer sequences. Figure [4] shows that tissue-specific 
enhancers can be discriminated by alignment-free methods (see 
Supplementary Fig S9 for the other datasets). While the performance 
decreases compared to using random sequences as the negative set, 
these results show that activity in a similar tissue is indeed reflected 
in a higher sequence similarity, gain, the neighbourhood extensions 
of N2 improves the results, further highlighting the value of this 
concept for regulatory sequences. 



4 DISCUSSION 

In this study, we showed that N2 improves alignment-free sequence 
comparison through its flexible extension to word neighbourhood 
counts, thereby covering approximate and orientation-independent 
word matches. Previously, the D2z score has been extended to allow 
for approximate matching words using estimates for the expectations 
and the variances based on a Ber noulli background model; however, 
no im plementation is available dBurden et all 120081 : iForet et all 
120061) . The framework that we present here is much more general 
and powerful. We allow for any desired word neighbourhood and 
associate words with weights such that the signal of words matching 
exactly is not lost. Furthermore, N2 can be computed on any 
background model order, which is essential to properly describe 
genomic sequences. Finally, N2 is much faster than D2z even 
without approximate matching, suggesting that a z-score calculation 
for an approximate D2 score would be infeasible for any dataset of 
realistic size. 
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Table 4. Comparison of the different methods on tissue-specific enhancers 



Performance on tissue-specific enhancer sequences 



5% Precision AUC ROC AUC PR 



Tissue FMLHFMLHFMLH 



D2 


0.61 


0.64 


0.55 


0.50 


0.55 


0.55 


0.50 


0.45 


0.54 


0.55 


0.51 


0.47 


D2z 


0.66 


0.69 


0.63 


0.56 


0.57 


0.57 


0.56 


0.53 


0.57 


0.57 


0.55 


0.52 


D2* 


0.71 


0.70 


0.67 


0.60 


0.62 


0.62 


0.59 


0.55 


0.60 


0.60 


0.58 


0.54 


N2* 


0.65 


0.64 


0.62 


0.58 


0.58 


0.57 


0.56 


0.53 


0.57 


0.56 


0.55 


0.53 


N2 rc 


0.71 


0.67 


0.68 


0.60 


0.61 


0.59 


0.58 


0.55 


0.60 


0.58 


0.58 


0.55 




0.84 


0.82 


0.79 


0.66 


0.66 


0.64 


0.63 


0.57 


0.66 


0.64 


0.63 


0.57 



Bold numbers indicate the best performance. Positive sequences were obtained by ChlP-Seq of p300 in forebrain (F), midbrain (M), limb (L) and heart (H) tissue of the mouse 
embryo. Negative sequences were randomly sampled from the mouse genome. All pairwise scores were computed with repeats masked, k = 6, background Markov model of order 1. 
Results show average values over 25 samples each time drawing 500 sequences. 



Interpolated Precision -Recall Plot (k=6) 
Forebrain vs. other enhancers 




0,0 0.2 0.4 0.6 0.8 1.0 
Recall 



Fig. 4. Precision-recall curve for forebrain enhancers in the mouse. 
Enhancers active in different tissues were used as the background set. 

The differences between N2* as used in this study and D2* are 
mainly due to the estimation of the background model. The better 
performance of D2* suggests that the concatenation of the sequences 
improves the accuracy of the background model; however, it 
drastically increases the running time. Here we observe that the 
improvement due to the extension to the word neighbourhood (N2* 
versus 7V2 mm,rc ) is better than the improvement due to different 
background model estimates (N2* versus D2*, Table|4]). 

With simulation studies we showed that N2 performs well on 
the task it was designed for, namely finding similarities between 
sequences based on shared words. Importantly, N2 is also able 
to measure similarity of in vivo identified enhancer sequences. 
This allows us to verify and increase our understanding of the 
architecture of regulatory elements: word count-based similarity 
measures are able to detect tissue-specific activity of enhancers, 
suggesting that CRMs contain scattered binding sites that contribute 
to their tissue specificity. Extending the word neighbourhood to the 
reverse complement (N2 rc ) improves the performance, showing that 
binding sites can occur on both strands of the CRM. Extending 
the neighbourhood to words with one mismatch ( j /V2 rc,mm ) further 
improves the performance on experimentally identified enhancers. 
This suggests that there are subtle signals like a common content 
of similar but not equal words which are characteristic of genomic 
enhancers. 



In this work, we assume that a high number of shared words 
represents a similar binding site content of enhancers. This 
assumption is violated by repeats, having a high number of shared 
words only due to high sequence similarity. For this reason, we 
mask repeats before calculating pairwise scores. Although some 
transcripti on factor binding sites have been found in repetitive 
sequences (iKunarso q/.U201Ql ; IZemoitel q/.[|2009L the sequence 
similarity of repeats is largely unrelated to regulatory activity and 
will eclipse any shared word count from common DNA binding 
motifs. We therefore recommend the usage of repeat masked 
sequences when comparing regulatory elements. 

The N2 similarity can be applied to other tasks than pairwise 
comparison. Alignment-fre e methods have been us ed to predict 
CRMs in flies and mouse (iKantorovitz et all l2009h . Our results 
on pairwise comparison of enhancers suggests that the N2 similarity 
could as well be used to predict the regulatory outcome of enhancers. 
In contrast to pairwise comparison, where we only rely on two 
sequences, prediction would allow to use training data, therefore 
we expect that the performance would improve for this task. 
Nevertheless, the large size of mammalian genomes limits prediction 
of regulatory sequences in a genome-wide manner due to an 
inevitable large number of false positive predictions. Among the 
applications where N2 might be very insightful are clustering and 
classification of regulatory sequences obtained from genome-wide 
studi es using transcription factor or co-activator binding data ( Lee 
et al, l201ll) . DNase hypersensitivity sites or enhancer- specific 
histone modifications. 



5 CONCLUSION 

In this study, we have presented N2, a novel alignment-free measure 
of sequence similarity that overcomes the limitations imposed by 
traditional exact word count-based methods. We have included 
the general concept of weighted word neighbourhood counts and 
shown that it improves the ability to detect similarity between 
regulatory sequences. The task of pairwise comparison of regulatory 
sequences is much harder than traditional pairwise alignment since 
only very few shared words might lead to a similar activity. We 
have demonstrated on a large-scale dataset of mammalian enhancer 
sequences that pairwise sequence similarity of non-homologous 
regulatory sequences is able to estimate similar in vivo activity. We 
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are therefore getting closer to understanding the sequence-dependent 
regulatory code within CRMs that enables the establishment of a 
large diversity of cell types coded in one genomic sequence. 

ACKNOWLEDGEMENTS 

We thank Sarah Behrens and Alena MysiCkova for their help on 
word statistics, Knut Reinert for helpful suggestions and discussions, 
and David Weese and Manuel Holtgrewe for their support with the 
SeqAn library. We additionally thank three anonymous reviewers 
for their helpful and constructive comments. 

FUNDING 

JG acknowledges funding from the International Max Planck 
Research School for Computational Biology and Scientific 
Computing (IMPRS-CBSC) and the Dahlem Research School 
(DRS). 

Conflict of Interest: none declared. 
REFERENCES 

Benson,G. (1999) Tandem repeats finder: a program to analyze dna sequences. Nucleic 

Acids Res., 27, 573-580. 
Blaisdell,B.E. (1986) A measure of the similarity of sets of sequences not requiring 

sequence alignment. Proc. Natl Acad. Sci. USA, 83, 5155-5159. 
Blow,M.J. et al. (2010) Chip-seq identification of weakly conserved heart enhancers. 

Nat. Genet., 42, 806-810. 
Burden,C.J. et al. (2008) Approximate word matches between two random sequences. 

Ann. Appl. Probab., 18, 1-21. 
CarpenterJ.E. et al. (2002) Assessment of the parallelization approach of d2-cluster for 

high-performance sequence clustering. /. Comput. Chem., 23, 755-757. 
Dai,Q. et al. (2008) Markov model plus k-word distributions: a synergy that produces 

novel statistical measures for sequence comparison. Bioinformatics, 24, 2296-2302. 
Doering,A. et al. (2008) SeqAn an efficient, generic C++ library for sequence analysis. 

BMC Bioinformatics, 9,11. 
Foret,S. et al. (2006) Asymptotic behaviour and optimal word size for exact and 

approximate word matches between random sequences. BMC Bioinformatics, 

7 (Suppl. 5), S21. 

Gardiner-Garden,M. and Frommer,M. (1987) CpG islands in vertebrate genomes. 
/. Mol. Biol, 196, 261-282. 



Gordan,R. et al. (2010) Finding regulatory dna motifs using alignment-free evolutionary 

conservation information. Nucleic Acids Res., 38, e90. 
Goto,T. et al. (1989) Early and late periodic patterns of even skipped expression are 

controlled by distinct regulatory elements that respond to different spatial cues. 

Cell, 57, 413-422. 

Hide,W. et al. (1994). Biological evaluation of d2, an algorithm for high-performance 

sequence comparison. /. Comput. Biol, 1, 199-215. 
Kantorovitz,M.R. et al. (2007) A statistical method for alignment-free comparison of 

regulatory sequences. Bioinformatics, 23, i249-i255. 
Kantorovitz,M.R. et al. (2009) Motif-blind, genome-wide discovery of cis-regulatory 

modules in drosophila and mouse. Dev. Cell, 17, 568-579. 
Kunarso,G. et al. (2010) Transposable elements have rewired the core regulatory 

network of human embryonic stem cells. Nat. Genet., 42, 631-634. 
Lee,D. et al. (2011) Discriminative prediction of mammalian enhancers from DNA 

sequence. Genome Res., [Epub ahead of print, doi:10.1101/gr.l21905.111, August 

29, 2011]. 

Lippert,R.A. et al. (2002) Distributional regimes for the number of k-word matches 
between two random sequences. Proc. Natl Acad. Sci. USA, 99, 13980-13989. 

Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to the search for 
similarities in the amino acid sequence of two proteins. /. Mol. Biol, 48, 443-453. 

Reinert, G. et al. (2009) Alignment-free sequence comparison (i): Statistics and power. 
/. Comput. Biol. 

Robin,S. et al. (2005) DNA, Words and Models. Cambridge University Press. 
Small,S. et al. (1991) Transcriptional regulation of a pair-rule stripe in drosophila. Genes 
Dev., 5, 827-839. 

Smith,T.F. and Waterman,M.S. (1981) Identification of common molecular 

subsequences. /. Mol. Biol, 147, 195-197. 
Thomas-Chollier,M. et al. (2011) RSAT 2011: regulatory sequence analysis tools. 

Nucleic Acids Res., 39, W86-W91. 
van Helden, J. (2004) Metrics for comparing regulatory sequences on the basis of pattern 

counts. Bioinformatics, 20, 399-406. 
Vinga,S. and Almeida, J. (2003) Alignment-free sequence comparison-a review. 

Bioinformatics, 19, 513-523. 
Visel,A. et al. (2009) Chip-seq accurately predicts tissue-specific activity of enhancers. 

Nature, 457, 854-858. 
Wilson,M.D. et al. (2008) Species-specific transcription in mice carrying human 

chromosome 21. Science, 322, 434-438. 
Wu,G.A. et al. (2009) Whole-proteome phylogeny of large dsdna virus families by an 

alignment-free method. Proc. Natl Acad. Sci. USA, 106, 12826-12831. 
Zemojtel,T. et al. (2009) Methylation and deamination of cpgs generate p53-binding 

sites on a genomic scale. Trends Genet., 25, 63-66. 
Zinzen,R.P. et al. (2009) Combinatorial binding predicts spatio-temporal cis-regulatory 

activity. Nature, 462, 65-70. 



663 



